LipNet: Sentence-level Lipreading

نویسندگان

  • Yannis M. Assael
  • Brendan Shillingford
  • Shimon Whiteson
  • Nando de Freitas
چکیده

Lipreading is the task of decoding text from the movement of a speaker’s mouth. Traditional approaches separated the problem into two stages: designing or learning visual features, and prediction. More recent deep lipreading approaches are end-to-end trainable (Wand et al., 2016; Chung & Zisserman, 2016a). All existing works, however, perform only word classification, not sentence-level sequence prediction. Studies have shown that human lipreading performance increases for longer words (Easton & Basala, 1982), indicating the importance of features capturing temporal context in an ambiguous communication channel. Motivated by this observation, we present LipNet, a model that maps a variable-length sequence of video frames to text, making use of spatiotemporal convolutions, an LSTM recurrent network, and the connectionist temporal classification loss, trained entirely end-to-end. To the best of our knowledge, LipNet is the first lipreading model to operate at sentence-level, using a single end-to-end speaker-independent deep model to simultaneously learn spatiotemporal visual features and a sequence model. On the GRID corpus, LipNet achieves 93.4% accuracy, outperforming experienced human lipreaders and the previous 79.6% state-of-the-art accuracy.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

LipNet: End-to-End Sentence-level Lipreading

Lipreading is the task of decoding text from the movement of a speaker’s mouth. Traditional approaches separated the problem into two stages: designing or learning visual features, and prediction. More recent deep lipreading approaches are end-to-end trainable (Wand et al., 2016; Chung & Zisserman, 2016a). However, existing work on models trained end-to-end perform only word classification, rat...

متن کامل

Factors predicting postoperative sentence scores in postlinguistically deaf adult cochlear implant patients.

A sample of 64 postlinguistically profoundly to totally deaf adult cochlear implant patients were tested without lipreading by means of the Central Institute for the Deaf (CID) sentence test 3 months postoperatively. Preoperative promontory stimulation results (thresholds, gap detection, and frequency discrimination), age, duration of profound deafness, cause of deafness, lipreading ability, po...

متن کامل

MEAN PERCENT CORRECT SCORES ON CLOSED-SET FEATURE DISCRIMINATION AND OPEN-SET WORD AND SENTENCE TESTS FOR FOUR ClllLDREN USING TWO DIFFERENT TACTILE ENCODERS IN COMBINATION WITII LIPREADING AND AIDED RESIDUAL HEARING

The results of the ABx speech feature discrimination test­ ing indicate that the device is flexible enough to present different speech features, including cues to voicing or vowel formant frequencies (first or second), which can be easily discriminated in the tactile display presented to users without additional cues from lipreading or aided residual hearing. The drop in perception of vowel F2 ...

متن کامل

Using Surface-Learning to improve Speech Recognition with Lipreading

We explore multimodal recognition by combining visual lipreading with acoustic speech recognition. We show that combining the visual and acoustic clues of speech improves the recog­ nition performance significantly especially in noisy environment. We achieve this with a hybrid speech recognition architecture, consisting of a new visual learning and tracking mechanism, a channel robust acoustic ...

متن کامل

Speed of Processing Phonological Information Presented Visually and Speechreading Proficiency

An experiment was designed to investigate whether the speed of processing phonological information presented visually is related to speechreading proficiency even after the proportion of the variance associated with an analytical consonantal viseme recognition task was taken into consideration. Young adults (n = 48) with normal hearing and vision completed 4 different visual-speech perception t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1611.01599  شماره 

صفحات  -

تاریخ انتشار 2016